MersV1, Main, Exploration, bibRecord, 000A66

An efficient classification algorithm for NGS data based on text similarity.

Identifieur interne : 000A66 ( Main/Exploration ); précédent : 000A65; suivant : 000A67

An efficient classification algorithm for NGS data based on text similarity.

Auteurs : Xiangyu Liao [République populaire de Chine] ; Xingyu Liao [République populaire de Chine] ; Wufei Zhu [République populaire de Chine] ; Lu Fang [République populaire de Chine] ; Xing Chen [République populaire de Chine]

Source :

Genetics research [ 1469-5073 ] ; 2018.

RBID : pubmed:30221607

Descripteurs français

KwdFr :
- Algorithmes, Données de la recherche comme sujet, Humains, Rhodobacter sphaeroides (génétique), Séquençage nucléotidique à haut débit, Vibrio cholerae (génétique).
MESH :
- génétique : Rhodobacter sphaeroides, Vibrio cholerae.
- Algorithmes, Données de la recherche comme sujet, Humains, Séquençage nucléotidique à haut débit.

English descriptors

KwdEn :
- Algorithms, Data Analysis, Datasets as Topic, High-Throughput Nucleotide Sequencing, Humans, Mycobacterium abscessus (genetics), Rhodobacter sphaeroides (genetics), Vibrio cholerae (genetics).
MESH :
- genetics : Mycobacterium abscessus, Rhodobacter sphaeroides, Vibrio cholerae.
- Algorithms, Data Analysis, Datasets as Topic, High-Throughput Nucleotide Sequencing, Humans.

Abstract

With the advancement of high-throughput sequencing technologies, the amount of available sequencing data is growing at a pace that has now begun to greatly challenge the data processing and storage capacities of modern computer systems. Removing redundancy from such data by clustering could be crucial for reducing memory, disk space and running time consumption. In addition, it also has good performance on reducing dataset noise in some analysis applications. In this study, we propose a high-performance short sequence classification algorithm (HSC) for next generation sequencing (NGS) data based on efficient hash function and text similarity. First, HSC converts all reads into k-mers, then it forms a unique k-mer set by merging the duplicated and reverse complementary elements. Second, all unique k-mers are stored in a hash table, where the k-mer string is stored in the key field, and the ID of the reads containing the k-mer are stored in the value field. Third, each hash unit is transformed into a short text consisting of reads. Fourth, texts that satisfy the similarity threshold are combined into a long text, the merge operation is executed iteratively until there is no text that satisfies the merge condition. Finally, the long text is transformed into a cluster consisting of reads. We tested HSC using five real datasets. The experimental results showed that HSC cluster 100 million short reads within 2 hours, and it has excellent performance in reducing memory consumption. Compared to existing methods, HSC is much faster than other tools, it can easily handle tens of millions of sequences. In addition, when HSC is used as a preprocessing tool to produce assembly data, the memory and time consumption of the assembler is greatly reduced. It can help the assembler to achieve better assemblies in terms of N50, NA50 and genome fraction.

DOI: 10.1017/S0016672318000058
PubMed: 30221607

Affiliations:

République populaire de Chine

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">An efficient classification algorithm for NGS data based on text similarity.</title>
<author><name sortKey="Liao, Xiangyu" sort="Liao, Xiangyu" uniqKey="Liao X" first="Xiangyu" last="Liao">Xiangyu Liao</name>
<affiliation wicri:level="1"><nlm:affiliation>Department of Oncology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000,P.R. China.</nlm:affiliation>
<country xml:lang="fr" wicri:curation="lc">République populaire de Chine</country>
<wicri:regionArea>Department of Oncology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000</wicri:regionArea>
<wicri:noRegion>Hubei 443000</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Liao, Xingyu" sort="Liao, Xingyu" uniqKey="Liao X" first="Xingyu" last="Liao">Xingyu Liao</name>
<affiliation wicri:level="1"><nlm:affiliation>School of Information Science and Engineering,Central South University,Changsha,Hunan 410083,P.R. China.</nlm:affiliation>
<country xml:lang="fr" wicri:curation="lc">République populaire de Chine</country>
<wicri:regionArea>School of Information Science and Engineering,Central South University,Changsha,Hunan 410083</wicri:regionArea>
<wicri:noRegion>Hunan 410083</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Zhu, Wufei" sort="Zhu, Wufei" uniqKey="Zhu W" first="Wufei" last="Zhu">Wufei Zhu</name>
<affiliation wicri:level="1"><nlm:affiliation>Department of Endocrinology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000,P.R. China.</nlm:affiliation>
<country xml:lang="fr" wicri:curation="lc">République populaire de Chine</country>
<wicri:regionArea>Department of Endocrinology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000</wicri:regionArea>
<wicri:noRegion>Hubei 443000</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Fang, Lu" sort="Fang, Lu" uniqKey="Fang L" first="Lu" last="Fang">Lu Fang</name>
<affiliation wicri:level="1"><nlm:affiliation>Department of Endocrinology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000,P.R. China.</nlm:affiliation>
<country xml:lang="fr" wicri:curation="lc">République populaire de Chine</country>
<wicri:regionArea>Department of Endocrinology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000</wicri:regionArea>
<wicri:noRegion>Hubei 443000</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Chen, Xing" sort="Chen, Xing" uniqKey="Chen X" first="Xing" last="Chen">Xing Chen</name>
<affiliation wicri:level="1"><nlm:affiliation>Department of Endocrinology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000,P.R. China.</nlm:affiliation>
<country xml:lang="fr" wicri:curation="lc">République populaire de Chine</country>
<wicri:regionArea>Department of Endocrinology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000</wicri:regionArea>
<wicri:noRegion>Hubei 443000</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PubMed</idno>
<date when="2018">2018</date>
<idno type="RBID">pubmed:30221607</idno>
<idno type="pmid">30221607</idno>
<idno type="doi">10.1017/S0016672318000058</idno>
<idno type="wicri:Area/PubMed/Corpus">000780</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">000780</idno>
<idno type="wicri:Area/PubMed/Curation">000780</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">000780</idno>
<idno type="wicri:Area/PubMed/Checkpoint">000A15</idno>
<idno type="wicri:explorRef" wicri:stream="Checkpoint" wicri:step="PubMed">000A15</idno>
<idno type="wicri:Area/Ncbi/Merge">001F65</idno>
<idno type="wicri:Area/Ncbi/Curation">001F65</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">001F65</idno>
<idno type="wicri:Area/Main/Merge">000A69</idno>
<idno type="wicri:Area/Main/Curation">000A66</idno>
<idno type="wicri:Area/Main/Exploration">000A66</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en">An efficient classification algorithm for NGS data based on text similarity.</title>
<author><name sortKey="Liao, Xiangyu" sort="Liao, Xiangyu" uniqKey="Liao X" first="Xiangyu" last="Liao">Xiangyu Liao</name>
<affiliation wicri:level="1"><nlm:affiliation>Department of Oncology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000,P.R. China.</nlm:affiliation>
<country xml:lang="fr" wicri:curation="lc">République populaire de Chine</country>
<wicri:regionArea>Department of Oncology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000</wicri:regionArea>
<wicri:noRegion>Hubei 443000</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Liao, Xingyu" sort="Liao, Xingyu" uniqKey="Liao X" first="Xingyu" last="Liao">Xingyu Liao</name>
<affiliation wicri:level="1"><nlm:affiliation>School of Information Science and Engineering,Central South University,Changsha,Hunan 410083,P.R. China.</nlm:affiliation>
<country xml:lang="fr" wicri:curation="lc">République populaire de Chine</country>
<wicri:regionArea>School of Information Science and Engineering,Central South University,Changsha,Hunan 410083</wicri:regionArea>
<wicri:noRegion>Hunan 410083</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Zhu, Wufei" sort="Zhu, Wufei" uniqKey="Zhu W" first="Wufei" last="Zhu">Wufei Zhu</name>
<affiliation wicri:level="1"><nlm:affiliation>Department of Endocrinology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000,P.R. China.</nlm:affiliation>
<country xml:lang="fr" wicri:curation="lc">République populaire de Chine</country>
<wicri:regionArea>Department of Endocrinology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000</wicri:regionArea>
<wicri:noRegion>Hubei 443000</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Fang, Lu" sort="Fang, Lu" uniqKey="Fang L" first="Lu" last="Fang">Lu Fang</name>
<affiliation wicri:level="1"><nlm:affiliation>Department of Endocrinology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000,P.R. China.</nlm:affiliation>
<country xml:lang="fr" wicri:curation="lc">République populaire de Chine</country>
<wicri:regionArea>Department of Endocrinology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000</wicri:regionArea>
<wicri:noRegion>Hubei 443000</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Chen, Xing" sort="Chen, Xing" uniqKey="Chen X" first="Xing" last="Chen">Xing Chen</name>
<affiliation wicri:level="1"><nlm:affiliation>Department of Endocrinology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000,P.R. China.</nlm:affiliation>
<country xml:lang="fr" wicri:curation="lc">République populaire de Chine</country>
<wicri:regionArea>Department of Endocrinology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000</wicri:regionArea>
<wicri:noRegion>Hubei 443000</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series><title level="j">Genetics research</title>
<idno type="eISSN">1469-5073</idno>
<imprint><date when="2018" type="published">2018</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Algorithms</term>
<term>Data Analysis</term>
<term>Datasets as Topic</term>
<term>High-Throughput Nucleotide Sequencing</term>
<term>Humans</term>
<term>Mycobacterium abscessus (genetics)</term>
<term>Rhodobacter sphaeroides (genetics)</term>
<term>Vibrio cholerae (genetics)</term>
</keywords>
<keywords scheme="KwdFr" xml:lang="fr"><term>Algorithmes</term>
<term>Données de la recherche comme sujet</term>
<term>Humains</term>
<term>Rhodobacter sphaeroides (génétique)</term>
<term>Séquençage nucléotidique à haut débit</term>
<term>Vibrio cholerae (génétique)</term>
</keywords>
<keywords scheme="MESH" qualifier="genetics" xml:lang="en"><term>Mycobacterium abscessus</term>
<term>Rhodobacter sphaeroides</term>
<term>Vibrio cholerae</term>
</keywords>
<keywords scheme="MESH" qualifier="génétique" xml:lang="fr"><term>Rhodobacter sphaeroides</term>
<term>Vibrio cholerae</term>
</keywords>
<keywords scheme="MESH" xml:lang="en"><term>Algorithms</term>
<term>Data Analysis</term>
<term>Datasets as Topic</term>
<term>High-Throughput Nucleotide Sequencing</term>
<term>Humans</term>
</keywords>
<keywords scheme="MESH" xml:lang="fr"><term>Algorithmes</term>
<term>Données de la recherche comme sujet</term>
<term>Humains</term>
<term>Séquençage nucléotidique à haut débit</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">With the advancement of high-throughput sequencing technologies, the amount of available sequencing data is growing at a pace that has now begun to greatly challenge the data processing and storage capacities of modern computer systems. Removing redundancy from such data by clustering could be crucial for reducing memory, disk space and running time consumption. In addition, it also has good performance on reducing dataset noise in some analysis applications. In this study, we propose a high-performance short sequence classification algorithm (HSC) for next generation sequencing (NGS) data based on efficient hash function and text similarity. First, HSC converts all reads into k-mers, then it forms a unique k-mer set by merging the duplicated and reverse complementary elements. Second, all unique k-mers are stored in a hash table, where the k-mer string is stored in the key field, and the ID of the reads containing the k-mer are stored in the value field. Third, each hash unit is transformed into a short text consisting of reads. Fourth, texts that satisfy the similarity threshold are combined into a long text, the merge operation is executed iteratively until there is no text that satisfies the merge condition. Finally, the long text is transformed into a cluster consisting of reads. We tested HSC using five real datasets. The experimental results showed that HSC cluster 100 million short reads within 2 hours, and it has excellent performance in reducing memory consumption. Compared to existing methods, HSC is much faster than other tools, it can easily handle tens of millions of sequences. In addition, when HSC is used as a preprocessing tool to produce assembly data, the memory and time consumption of the assembler is greatly reduced. It can help the assembler to achieve better assemblies in terms of N50, NA50 and genome fraction.</div>
</front>
</TEI>
<affiliations><list><country><li>République populaire de Chine</li>
</country>
</list>
<tree><country name="République populaire de Chine"><noRegion><name sortKey="Liao, Xiangyu" sort="Liao, Xiangyu" uniqKey="Liao X" first="Xiangyu" last="Liao">Xiangyu Liao</name>
</noRegion>
<name sortKey="Chen, Xing" sort="Chen, Xing" uniqKey="Chen X" first="Xing" last="Chen">Xing Chen</name>
<name sortKey="Fang, Lu" sort="Fang, Lu" uniqKey="Fang L" first="Lu" last="Fang">Lu Fang</name>
<name sortKey="Liao, Xingyu" sort="Liao, Xingyu" uniqKey="Liao X" first="Xingyu" last="Liao">Xingyu Liao</name>
<name sortKey="Zhu, Wufei" sort="Zhu, Wufei" uniqKey="Zhu W" first="Wufei" last="Zhu">Wufei Zhu</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000A66 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000A66 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     pubmed:30221607
   |texte=   An efficient classification algorithm for NGS data based on text similarity.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Main/Exploration/RBID.i   -Sk "pubmed:30221607" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021

Serveur d'exploration MERS

An efficient classification algorithm for NGS data based on text similarity.

An efficient classification algorithm for NGS data based on text similarity.

Source :

Descripteurs français

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki

	Serveur d'exploration MERS
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.